Collecting University Rankings for Comparison Using Web Extraction and Entity Linking Techniques
نویسنده
چکیده
University rankings are rankings of institutions in higher education, ordered by combinations of factors. Rankings are conducted by various organizations, such as news media, websites, governments, academics and private corporations. Due to huge financial and other interests, the rankings of universities worldwide recently received increasing attention. The rankings are based on different criteria and collect data in various ways. As a result, there is a large divergence in the specific rankings of different institutions. In order to compare rankings so that safe conclusions about their reliability are drawn, data from the sites of different such ranking lists must be collected. In this paper we present this first step for university ranking comparison, namely we discuss in detail how we have developed a Prolog application, called URank, that collects the data, by a) extracting them from the various ranking list web sites using web data extraction techniques, b) uniquely identifying the University entities within the above lists by linking them to the DBpedia linked open data set, and c) constructing a combined data set by merging the individual ranking list data sets using their DBpedia URI as a primary key.
منابع مشابه
Estimating the Parameters for Linking Unstandardized References with the Matrix Comparator
This paper discusses recent research on methods for estimating configuration parameters for the Matrix Comparator used for linking unstandardized or heterogeneously standardized references. The matrix comparator computes the aggregate similarity between the tokens (words) in a pair of references. The two most critical parameters for the matrix comparator for obtaining the best linking results a...
متن کاملPresenting a method for extracting structured domain-dependent information from Farsi Web pages
Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...
متن کاملEntity Linking at Web Scale
This paper investigates entity linking over millions of high-precision extractions from a corpus of 500 million Web documents, toward the goal of creating a useful knowledge base of general facts. This paper is the first to report on entity linking over this many extractions, and describes new opportunities (such as corpus-level features) and challenges we found when entity linking at Web scale...
متن کاملThe Effect of Transitive Closure on the Calibration of Logistic Regression for Entity Resolution
This paper describes a series of experiments in using logistic regression machine learning as a method for entity resolution. From these experiments the authors concluded that when a supervised ML algorithm is trained to classify a pair of entity references as linked or not linked pair, the evaluation of the model’s performance should take into account the transitive closure of its pairwise lin...
متن کاملEntity Extraction from the Web with WebKnox
This paper describes a system for entity extraction from the web. The system uses three different extraction techniques which are tightly coupled with mechanisms for retrieving entity rich web pages. The main contributions of this paper are a new entity retrieval approach, a comparison of different extraction techniques and a more precise entity extraction algorithm. The presented approach allo...
متن کامل